Making massive probabilistic databases practical

نویسندگان

Andrei Todor

Alin Dobra

Tamer Kahveci

Christopher Dudley

چکیده

Existence of incomplete and imprecise data has moved the database paradigm from deterministic to probabilistic information. Probabilistic databases contain tuples that may or may not exist with some probability. As a result, the number of possible deterministic databases that can be instances of a probabilistic database grows exponentially with the number of probabilistic tuples. In this paper, we consider the problem of answering both aggregate and nonaggregate queries on massive probabilistic databases. We adopt the tuple independence model, in which each tuple is assigned a probability value. We develop a method that exploits Probability Generating Functions (PGF) to answer such queries efficiently. Our method maintains a polynomial for each tuple. It incrementally builds a master polynomial that expresses the distribution of the possible result values precisely. We also develop an approximation method that finds the distribution of the result value with negligible errors. Our experiments suggest that our methods are orders of magnitude faster than the most recent systems that answer such queries, including MayBMS and SPROUT. In our experiments, we were able to scale up to several terabytes of data on TPCH queries, while existing methods could only run for a few gigabytes of data on the same queries.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Analytics over Probabilistic Unmerged Duplicates

This paper introduces probabilistic databases with unmerged duplicates (DBud), i.e., databases containing probabilistic information about instances found to describe the same real-world objects. We discuss the need for efficiently querying such databases and for supporting practical query scenarios that require analytical or summarized information. We also sketch possible methodologies and tech...

متن کامل

Extension of Cube Attack with Probabilistic Equations and its Application on Cryptanalysis of KATAN Cipher

Cube Attack is a successful case of Algebraic Attack. Cube Attack consists of two phases, linear equation extraction and solving the extracted equation system. Due to the high complexity of equation extraction phase in finding linear equations, we can extract nonlinear ones that could be approximated to linear equations with high probability. The probabilistic equations could be considered as l...

متن کامل

ProbCons: Probabilistic consistency-based multiple sequence alignment.

To study gene evolution across a wide range of organisms, biologists need accurate tools for multiple sequence alignment of protein families. Obtaining accurate alignments, however, is a difficult computational problem because of not only the high computational cost but also the lack of proper objective functions for measuring alignment quality. In this paper, we introduce probabilistic consist...

متن کامل

Compiling Relational Database Schemata into Probabilistic Graphical Models

A majority of scientific and commercial data is stored in relational databases. Probabilistic models over such datasets would allow probabilistic queries, error checking, and inference of missing values, but to this day machine learning expertise is required to construct accurate models. Fortunately, current probabilistic programming tools ease the task of constructing such models [1, 2, 3, 4, ...

متن کامل

Ranking and Clustering in Probabilistic Databases

The dramatic growth in the number of application domains that naturally generate probabilistic, uncertain data has resulted in a need for efficiently supporting complex querying and decision-making over such data. In this paper, we address the problem of on-the-fly clustering and ranking over probabilistic databases. We begin with a systematic exploration of ranking in probabilistic databases b...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

CoRR

دوره abs/1307.0844 شماره

صفحات -

تاریخ انتشار 2013

Making massive probabilistic databases practical

نویسندگان

چکیده

منابع مشابه

Analytics over Probabilistic Unmerged Duplicates

Extension of Cube Attack with Probabilistic Equations and its Application on Cryptanalysis of KATAN Cipher

ProbCons: Probabilistic consistency-based multiple sequence alignment.

Compiling Relational Database Schemata into Probabilistic Graphical Models

Ranking and Clustering in Probabilistic Databases

عنوان ژورنال:

اشتراک گذاری